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Abstract 

Context-free language theory is a well-established area of mathematics, relevant to computer science founda¬ 
tions and technology. This paper presents the preliminary results of an ongoing formalization project using 
context-free grammars and the Coq proof assistant. The results obtained so far include the representation 
of context-free grammars, the description of algorithms for some operations on them (union, concatenation 
and closure) and the proof of related theorems (e.g. the correctness of these algorithms). A brief survey of 
related works is presented, as well as plans for further development. 
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1 Introduction 

The fundamental mathematical activity of stating and proving theorems has been 
traditionally done by professionals that rely purely on their own personal efforts in 
order to accept or refuse any new proposal, after extensive checking. This style of 
work, which has been used for centuries by mathematicians all over the world, is 
now changing thanks to computer technology support. 

The so called “proof assistants” are software tools that are used in regular com¬ 
puters and offer a friendly and fully checked environment where one can describe 
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mathematical objects and properties and then prove theorems about them. The 
main advantage of their use is to automate the validation of these demonstrations. 
When applied to program development, these tools are also helpful in checking the 
correctness of an existing program and in the construction of correct programs. In 
order to obtain these benefits, however, one must first be familiar with the underly¬ 
ing mathematical theory of each of these tools. 

Language theory is a well-established area in mathematics and computer science, 
which was extensively developed during the 1960s and 1970s. Automata theory 
came along and since the 1960s the two areas are generally considered as a single 
discipline. Fundamental to the study and development of computer languages, as 
well as computer languages processing technology, the theory also leads to important 
conclusions about the limits and properties of the computation process itself. 

New and different uses of Coq and other proof assistants are announced fre¬ 
quently. These include, for example, the proof of the Four Color Theorem by 
Georges Gonthier and Benjamin Werner at Microsoft Research in 2005 [17] and 
also the demonstration of the Feit-Thompson Theorem by a group led by Georges 
Gonthier in 2012 [18]. Also, there are important projects in the areas of mathematics 
[19], compiler certification [27] and digital security certification [16], among others 
[ 22 ] [ 21 ]. 

The idea of formalizing context-free language theory in the Goq proof assistant 
is discussed in Section 2. In particular, we present the goals that are being set, the 
strategy adopted and the results obtained so far. Finally, in Section 3 the plan for 
the rest of this research is presented and in Section 4 related work by various other 
researchers is considered. 


2 Formalization 

The objective of this work is to formalize a substantial part of context-free language 
theory in the Coq proof assistant, making it possible to reason about it in a fully 
checked environment, with all the related advantages. Initially, however, the focus 
has been restricted to context-free grammars and associated results. Stack automata 
and their relation to context-free grammars shall be considered in the future. 

More information on the Coq proof assistant, as well as on the syntax and se¬ 
mantics of the following definitions and statements, can be found in [20], [12] and 

[7]. 

The motivation for this work comes from (i) the large amount of formalization 
already existing for regular language theory; (ii) the apparent absence of a simi¬ 
lar formalization effort for context-free language theory, at least in the Coq proof 
assistant and (iii) the high interest in context-free language theory formalization 
as a result of its practical importance in computer technology (e.g. correctness of 
language processing software). More information on related works is provided in 
Section 4. 

Context-free grammars have been represented in Coq very closely to the usual 
algebraic definition G = {V,'E, P, S), where S is the set of terminal symbols (used 
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in the construction of the sentences of the language generated by the grammar), 
N = F — S is the set of non-terminal symbols (representing different sentence 
abstractions), P is the set of rules and S £ N is the start symbol (also called initial 
or root symbol). Rules have the form 0^/3, with a £ N and (3 £V*. The following 
record representation has been used: 

Record cfg: Type:= { 
non_terininal: Type; 
terminal: Type; 
start_symbol: non_terminal; 
sf:= list (non_terminal + terminal); 
rules: non_terminal -> sf -> Prop 
}. 

The definition above states that cfg is a new type and contains four components. 
The first is non_terminal, which represents the set of the non-terminal symbols of 
the grammar, the second is terminal, representing the set of terminal symbols, the 
third is start_symbol and the fourth is rules, that represent the rules of the gram¬ 
mar. Rules are propositions (represented in Coq by Prop) that take as arguments 
a non-terminal symbol and a (possibly empty) list of non-terminal and terminal 
symbols (corresponding, respectively, to the left and right-hand side of a rule), sf 
(sentential form) is a list of terminal and non-terminal symbols. 

Another fundamental concept used in this formalization is the idea of derivation: 
a grammar g derives a string s2 from a string si if there exists a series of rules in g 
that, when applied to si, eventually result in s2. An inductive predicate definition 
of this concept in Coq uses two constructors: 

Inductive derives (g: cfg): sf g -> sf g -> Prop := 

I derives_ref1: forall s: sf g, 
derives g s s 

I derives_step: forall si s2 s3: sf g, 

forall left: non_terminal g, 
forall right: sf g, 

derives g si (s2 ++ ini left : : s3)°/olist -> 

rules g left right -> 

derives g si (s2 ++ right ++ s3)yolist. 


The constructors of this definition (derives_ref 1 and derives_step) are the 
axioms of our theory. Constructor derives_ref 1 asserts that every sentential form 
s can be derived from s itself. Constructor derives_step states that if a sentential 
form that contains the left-hand side of a rule is derived by a grammar, then the 
grammar derives the sentential form with the left-hand side substituted by the right- 
hand side of the same rule. This case corresponds the application of a rule in a direct 
derivation step. 

Finally, a grammar generates a string if this string can be derived from its root 
symbol: 
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Definition generates (g: cfg) (s: sf g): Prop:= 
derives g [ini (start_syinbol g)] s. 

With these definitions, it has been possible to prove lemmas and also to imple¬ 
ment functions that operate on grammars, all of which were useful when proving 
the main theorems. 

After context-free grammars and derivations were defined, the basic operations 
of concatenation, union and closure of context-free grammars were implemented in 
a rather straightforward way. These operations provide, as their name suggests, new 
context-free grammars that generate, respectively, the concatenation, the union and 
the closure of the language(s) generated by the input grammar(s). The code for 
these terms is presented below: 

Union: 

Definition g_uni_t (gl g2: cfg): Type:= 

(terminal gl + terminal g2)yotype. 

Inductive g_uni_nt (gl g2: cfg): Type := 

I Start_uni : g_uni_nt gl g2 

I Transfl_uni : non_terminal gl -> g_uni_nt gl g2 
I Transf2_uni : non_terminal g2 -> g_uni_nt gl g2. 

Definition g_uni_sf_lift_left (gl g2: cfg) 

(c: non_terminal gl + terminal gl): g_uni_nt gl g2 + g_uni_t gl g2:= 
match c with 

I ini nt => ini (Transfl_uni gl g2 nt) 

I inr t => inr (ini t) 
end. 

Definition g_uni_sf_lift_right (gl g2: cfg) 

(c: non_terminal g2 + terminal g2): g_uni_nt gl g2 + g_uni_t gl g2:= 
match c with 

I ini nt => ini (Transf2_uni gl g2 nt) 

I inr t => inr (inr t) 
end. 

Inductive g_uni_rules (gl g2: cfg): g_uni_nt gl g2 -> 
list (g_uni_nt gl g2 + g_uni_t gl g2) -> Prop := 

I Startl_uni: g_uni_rules gl g2 (Start_uni gl g2) 

[ini (Transfl_uni gl g2 (start_symbol gl))] 

I Start2_uni: g_uni_rules gl g2 (Start_uni gl g2) 

[ini (Transf2_uni gl g2 (start_symbol g2))] 

I Liftl_uni: forall nt s, 

rules gl nt s -> 

g_uni_rules gl g2 (Transfl_uni gl g2 nt) 

(map (g_uni_sf_lift_left gl g2) s) 
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I Lift2_uni: forall nt s, 

rules g2 nt s -> 

g_uni_rules gl g2 (Transf2_uni gl g2 nt) 

(map (g_uni_sf_lift_right gl g2) s). 

Definition g_uni (gl g2: cfg): cfg := {| 
non_terminal:= g_uni_nt gl g2; 
terminal:= g_uni_t gl g2; 
start_symbol:= Start_uni gl g2; 
rules:= g_uni_rules gl g2 
|}. 

The first definition above (g_uni_t) represents the type of the terminal symbols 
of the union grammar, created from the terminal symbols of the source grammars. 
Basically, it states that the terminals of both source grammars become terminals in 
the union grammar, by means of a disjoint union operation. For the non-terminal 
symbols, a more complex statement is required. First, the non-terminals of the 
source grammars are mapped to non-terminals of the union grammar. Second, there 
is the need to add a new and unique non-terminal symbol (Start_uni), which will 
be the root of the union grammar. This is accomplished by the use of an inductive 
type definition (g_uni_nt), in contrast with the previous case, that used a simple 
non inductive definition. 

The functions g_uni_sf _lift_lef t and g_uni_sf _lif t_right simply map sen¬ 
tential forms from, respectively, the first or the second grammar in a pair, and pro¬ 
duce sentential forms for the union grammar. This will be useful when defining the 
rules of the union grammar. 

The rules of the union grammar are represented by the inductive definition 
g_uni_rules. Constructors Startl_uni and Start2_uni state that two new rules 
are added to the union grammar: respectively the rule that maps the new root to 
the root of the first grammar, and the rule that maps the new root to the root of the 
second grammar. Then, constructors Liftl_uni and Lift2_uni simply map rules 
in first (resp. second) grammar in rules of the union grammar. 

Finally, g_uni describes how to create a union grammar from two arbitrary source 
grammars. It uses the previous definitions to give values to each of the components 
of a new grammar dehnition. 

Similar definitions were created to represent the concatenation of any two gram¬ 
mars and the closure of a grammar: 

Concatenation: 

Definition g_cat_t (gl g2: cfg): Type:= 

(terminal gl + terminal g2)yotype. 

Inductive g_cat_nt (gl g2: cfg): Type := 

I Start_cat : g_cat_nt gl g2 

I Transfl_cat : non_terminal gl -> g_cat_nt gl g2 
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I Transf2_cat : non_terininal g2 -> g_cat_nt gl g2. 


Definition g_cat_sf_lift_left (gl g2: cfg) 

(c: non_terininal gl + terminal gl): g_cat_nt gl g2 + g_cat_t gl g2 
match c with 

I ini nt => ini (Transfl_cat gl g2 nt) 

I inr t => inr (ini t) 
end. 

Definition g_cat_sf_lift_right (gl g2: cfg) 

(c: non_terminal g2 + terminal g2): g_cat_nt gl g2 + g_cat_t gl g2 
match c with 

I ini nt => ini (Transf2_cat gl g2 nt) 

I inr t => inr (inr t) 
end. 

Inductive g_cat_rules (gl g2: cfg): g_cat_nt gl g2 -> 
list (g_cat_nt gl g2 + g_cat_t gl g2) -> Prop := 

I New_cat: g_cat_rules gl g2 (Start_cat gl g2) 

([ini (Transfl_cat gl g2 (start_symbol gl))]++ 

[ini (Transf2_cat gl g2 (start_symbol g2))])%list 
I Liftl_cat: forall nt s, 

rules gl nt s -> 

g_cat_rules gl g2 (Transfl_cat gl g2 nt) 

(map (g_cat_sf_lift_left gl g2) s) 

I Lift2_cat: forall nt s, 

rules g2 nt s -> 

g_cat_rules gl g2 (Transf2_cat gl g2 nt) 

(map (g_cat_sf_lift_right gl g2) s). 

Definition g_cat (gl g2: cfg): cfg := {| 
non_terminal:= g_cat_nt gl g2; 
terminal:= g_cat_t gl g2; 
start_symbol:= Start_cat gl g2; 
rules:= g_cat_rules gl g2 
|}. 

Closure: 

Definition g_clo_t (g: cfg): Type:= 

(terminal g)yotype. 

Inductive g_clo_nt (g: cfg): Type := 

I Start_clo : g_clo_nt g 

I Transf_clo : non_terminal g -> g_clo_nt g. 
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Definition g_clo_sf_lift (g: cfg) 

(c: non_terininal g + terminal g): g_clo_nt g + g_clo_t g:= 
match c with 

I ini nt => ini (Transf_clo g nt) 

I inr t => inr t 
end. 

Inductive g_clo_rules (g: cfg): g_clo_nt g -> 
list (g_clo_nt g + g_clo_t g) -> Prop := 

I Newl_clo: g_clo_rules g (Start_clo g) 

([ini (Start_clo g)]++ 

[ini (Transf_clo g (start_symbol g))]) 

I New2_clo: g_clo_rules g (Start_clo g) [] 

I Lift_clo: forall nt s, 

rules g nt s -> 

g_clo_rules g (Transf_clo g nt) (map (g_clo_sf_lift g) s). 

Definition g_clo (g: cfg): cfg := {| 
non_terminal:= g_clo_nt g; 
terminal:= g_clo_t g; 
start_symbol:= Start_clo g; 
rules:= g_clo_rules g 
|}. 

Although simple in their structure, it must be proved that the definitions g_uni, 
g_cat and g_clo always produce the correct result. In other words, the algorithms 
embedded in these definitions must be “certified”. The process of doing such a certifi¬ 
cation is called “program verihcation”, and is one of the main goals of formalization. 
In order to accomplish this, we must hrst state theorems, using hrst-order logic, 
that capture the expected semantics of these definitions. Finally, we have to derive 
proofs of the correctness of these theorems. 

This can be done with a pair of theorems for each definition/algorithm: the 
first relates the output to the inputs, and the other one does the inverse, providing 
assumptions about the inputs once an output is generated. This is necessary in order 
to guarantee that the algorithm does only what one would expect, and no more. 
Concatenation, direct operation: 

Theorem g_cat_correct (gl g2: cfg)(si: sf gl)(s2: sf g2): 
generates gl si /\ generates g2 s2 -> 
generates (g_cat gl g2) 

((map (g_cat_sf_lift_left gl g2) sl)++ 

(map (g_cat_sf_lift_right gl g2) s2))°/olist. 

The above theorem, for example, states that if context-free grammars gl and 
g2 generate, respectively, strings si and s2, then the concatenation of these two 
grammars, according to the proposed algorithm, generates the concatenation of 
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string si to string s2. As mentioned before, the above theorem alone does not 
guarantee that g_cat will not produce outputs other than the concatenation of its 
input strings. This idea is captured by the following complementary theorem: 
Concatenation, inverse operation: 

Theorem g_cat_correct_inv (gl g2: cfg)(s: sf (g_cat gl g2)): 
generates (g_cat gl g2) s -> 
exists si: sf gl, 
exists s2: sf g2, 

s =(map (g_cat_sf_lift_left gl g2) sl)++ 

(map (g_cat_sf_lift_right gl g2) s2) /\ 
generates gl si /\ 
generates g2 s2. 

The idea here is to express that, if a string is generated by g_cat, then it must 
only result from the concatenation of strings generated by the grammars merged by 
the algorithm. Together, these two theorems represent the semantics of the context- 
free grammar concatenation operation presented. The same ideas have been applied 
to the statement and proof of the following theorems, relative to the union and 
closure operations: 

Union, direct operation: 

Theorem g_uni_correct (gl g2: cfg)(sl; sf gl)(s2: sf g2): 
generates gl si \/ generates g2 s2 -> 

generates (g_uni gl g2) (map (g_uni_sf_lift_left gl g2) si) \/ 
generates (g_uni gl g2) (map (g_uni_sf_lift_right gl g2) s2). 

Union, inverse operation: 

Theorem g_uni_correct_inv (gl g2: cfg)(s: sf (g_uni gl g2)): 
generates (g_uni gl g2) s -> 

(s=[inl (start_symbol (g_uni gl g2))]) \/ 

(exists si: sf gl, 

(s=(map (g_uni_sf_lift_left gl g2) si) /\ generates gl si)) \/ 

(exists s2: sf g2, 

(s=(map (g_uni_sf_lift_right gl g2) s2) /\ generates g2 s2)). 

Closure, direct operation: 

Theorem g_clo_correct (g: cfg)(s: sf g)(s’: sf (g_clo g)): 
generates (g_clo g) nil /\ 

(generates (g_clo g) s’ /\ generates g s -> 
generates (g_clo g) (s’++ (map (g_clo_sf_lift g)) s)). 

Closure, inverse operation: 

Theorem g_clo_correct_inv (g: cfg)(s: sf (g_clo g)): 
generates (g_clo g) s -> 

(s=[]) \/ 

(s=[inl (start_symbol (g_clo g))]) \/ 


(exists s’: sf (g_clo g), 
exists s’’: sf g, 

generates (g_clo g) s’ /\ generates g s’’ /\ 
s=s’++inap (g_clo_sf_lift g) s’’). 

The proofs of all the six main theorems have been completed (g_uni_correct and 
g_uni_correct_inv for union, g_cat_correct and g_cat_correct_inv for concate¬ 
nation and g_clo_correct and g_clo_correct_inv for closure). As an interesting 
side result, some useful and generic lemmas have also been proved during this pro¬ 
cess. Among these, for example, one that asserts the context-free characteristic of 
these derivations: 

Theorem derives_context_free_add: 

forall g:cfg, 

forall si s2 s s’: sf g, 

derives g si s2 -> derives g (s++sl++s’) (s++s2++s’). 

and one that states the transitivity of the derives relation: 

Theorem derives_trans: 
forall g: cfg, 
forall si s2 s3: sf g, 
derives g si s2 -> 
derives g s2 s3 -> 
derives g si s3. 

All the dehnitions and proof scripts were written in plain Coq using CoqlDE (a 
graphical interface for Windows), and are available for download at: 
http://WWW.univasf.edu.br/~marcus.ramos/coq/cfg-closure.v. 

Basically, the proof scripts use induction on the predicate derives and also direct 
list manipulation. The libraries used were Ascii, String and List. 

3 Further Work 

The proper specihcation of inductive predicates and related definitions leads natu¬ 
rally to simple functions that promote the necessary transformations on the objects 
described, and also to readable statements of lemmas and theorems. 

The current work concentrated on the formalization of context-free grammars, 
derivations and closure operations, as well as on the certification of the correctness 
of these operations. The plan now is to use the same definitions to achieve the 
following goals: 

(i) Describe algorithms for the simplification of context-free grammars (namely 
elimination of inaccessible and useless symbols, unit and empty productions) 
and prove their correctness; 

(ii) Similarly for the construction of Chomsky and Greibach normal forms for 
context-free grammars; 

(hi) Prove some decidable questions on context-free languages, especially those 
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whose proofs rely on context-free grammars; 

(iv) Finally, obtain a formal proof of the Pumping Lemma for context-free lan¬ 
guages, and to use it to prove the existence of non context-free languages. 

All the theory and results related to stack automata, and also to the relation of 
stack automata to context-free grammars, shall be left for future work. 


4 Related Work 

Language and automata theory has been subject of formalization since the mid- 
1980s, when Kreitz used the Nuprl proof assistant to prove results about determin¬ 
istic finite automata and the pumping lemma for regular languages [26]. Since then, 
the theory of regular languages has been formalized partially by different researchers 
using different proof assistants (see [11], [24], [15], [10], [28], [30], [2], [1], [29] [8], 
[9], [3], [13], [25] and [33]). The most recent and complete formalization, however, 
is the work by Jan-Oliver Kaiser [14], which used Coq and the SSReflect extension 
to prove the main results of regular language theory. 

Context-free language theory has not been formalized the same extent so far. 
The more relevant works are the ones published by Jourdan, Pettier and Leroy 
(using Coq, [23]) and Ridge (HOL4, [32]), on parser generation and validation, and 
Norrish and Barthwal (HOL4, [4], [5], [6]), on theory formalization. 

When it comes to computability theory and Turing machines related classes of 
languages, formalization bas been approached by Asperti and Ricciotti (Matita, [3]), 
Xu, Zhang and Urban (Isabelle/HOL, [34]) and Norrish (HOL4, [31]). 


5 Conclusions 

The present paper reports an ongoing research effort towards formalizing the classical 
context-free language theory, initially based only on context-free grammars, in the 
Coq proof assistant. All important objects have already been described and basic 
closure operations on grammars have already been implemented. Proofs of the 
correctness of the concatenation, union and closure operations (for both direct and 
inverse ways) were constructed. 

When the work is complete, it should be useful for a few different purposes. 
Among them, to offer a complete and mathematically precise description of the 
behavior of the objects of context-free language theory. Second, to offer fully checked 
and mechanized demonstrations of its main results. Third, to allow for the certified 
and efficient implementation of its relevant algorithms in a proper programming 
language. Fourth, to permit the experimentation in an educational environment 
in the form of a tool set, in a laboratory where further practical observations and 
developments can be done, for the benefit of students, teachers, professionals and 
researchers. 
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